First-Order Methods

19.8 μs

Spoiler alert: Some methods are trash!

13.8 μs
68.5 μs
37 μs
169
139 μs
0.27
72 μs
0.82
82.8 μs
16.1 ms
67.6 ms

Gradient Descent

9.8 μs
231 ms

We can choose the direction of steepest descent as the descent direction d, and this is the opposite of the gradient.

11.6 μs

d(k)=g(k)||g(k)||

20.1 μs

It is proven (proof left to the reader as excercise) that each direction is orthogonal to the previous.

12.2 μs
4.9 μs
fibonacci_search (generic function with 1 method)
168 μs
bracket_minimum (generic function with 2 methods)
120 μs
minimize (generic function with 1 method)
34.1 μs
optimize_line_search (generic function with 1 method)
71.5 μs
gradient_descent_mixed (generic function with 1 method)
49.4 μs
f₁ (generic function with 2 methods)
60.9 μs
1.1 ms
184 ms
50
45.4 ms
-0.96
36.5 ms
1.69
78.6 μs
138 ms

The optimal point is near (0.9605721406277097, 0.9196323117617097)

14.7 μs

The optimized value is 0.00160157399060238

14.7 μs
10.2 μs
1.6 ms
init! (generic function with 1 method)
31.7 μs
step! (generic function with 1 method)
34.2 μs
optimize! (generic function with 1 method)
72 μs
f₂
f₁ (generic function with 2 methods)
1.5 μs
20
95.5 μs
-0.96
83.3 μs
1.69
90.9 μs
37 ms

The optimal point is near (0.7414365199553092, 0.6976297924702967) 👀

9.4 μs

The optimized value is 0.17622960698092363 😒🤢

12.6 μs

Gradient descent (alone) is not miraculous 😢

9.9 μs

Conjugate Gradient

10.5 μs

Gradient descent is 🗑️ in narrow valleys. This method uses a quadratic approximation of the local function to find optimal points.

9.7 μs

Conjugate Gradient Descent is GUARANTEED to optimize a n-dimensional quadratic function in n steps.

10.4 μs

This method can be applied to nonquadratic functions. Smooth functions behave quadratic close to a local minimum (oh-oh).

9.7 μs

This means that we need to get close to the minimum in order to obtain benefits. 🗑️

8.4 μs

This method uses information from last gradients to improve:

9.8 μs

d(k+1)=g(k+1)+β(k)d(k)

12.6 μs

Fletcher-Reeves:β(k)=(g(k))2(g(k1))2

21.1 μs

Polak-Ribiere:β(k)=g(k)T(g(k)g(k1))(g(k1))2

15.6 μs

👀 For guaranteed convergence on the Polak-Ribiere method we need to reset β to 0 if it goes negative.

11.2 μs
ConjugateGradientDescent
4.2 ms
init! (generic function with 2 methods)
34.4 μs
15.7 ms
step! (generic function with 10 methods)
55.8 μs
f₃
f₁ (generic function with 2 methods)
1.9 μs
5
57.8 μs
-0.96
80.3 μs
1.69
93 μs
34.3 ms

The optimal point is near (0.9926081864948202, 0.9847483130588324) 🌟

17.9 μs

The optimized value is 5.6004977270041056e-5 🥳

20.3 μs

Momentum

8.7 μs

The idea is to simulate a force (that updates velocity) to avoid slow movement in a nearly flat surface (gradients with small magnitude will fail miserably).

7.4 μs

v(k+1)=βv(k)αg(k)

10.1 μs

x(k+1)=x(k)+v(k+1)

8.8 μs
Momentum
3.1 ms
init! (generic function with 3 methods)
32.1 μs
step! (generic function with 3 methods)
52.1 μs
f₄ (generic function with 2 methods)
58.1 μs
23
77.2 μs
-0.58
67.5 μs
0.36
6.3 ms
9.1 ms

The optimal point is near (1.0263932677227146, 1.304148459285668) 🤷‍♀️

13.2 μs

The optimized value is 6.28400683251055 🤦‍♂️

10.5 μs

It's like the point is falling straight to the minimum 😲

7.8 μs

But... it might go too fast

7.5 μs

Nesterov Momentum

6.7 μs

If it goes too fast try predicting how fast it will go the next iteration...

10 μs

v(k+1)=βv(k)αf(x(k)+βv(k))

9.3 μs
step! (generic function with 4 methods)
2.3 ms
f₅
f₄ (generic function with 2 methods)
3 μs
1000
68.9 μs
-0.77
64.8 μs
1.42
110 μs
15.6 ms

The optimal point is near (0.9999678362763855, 0.9999355449078215) 🤔

12.2 μs

The optimized value is 1.03616095695321e-9 🤔🤔

17.7 μs

It's still going too fast...

7.4 μs

Adagrad

7.5 μs

Adaptative subgradient method

8.7 μs

This method updates learning rate for each component.

8.3 μs

xi(k+1)=xi(k)αϵ+si(k)gi(k)

9 μs

si(k)=j=1k(gi(j))2

7.8 μs

ϵ is a small value (1e-8) to prevent division by zero.

7.4 μs

α is usually set to .01, because of all the operations it doesn't matter too much

8.9 μs

Problem: As we can see, the components of S grow over time, making the learning rate decrease (and become infinitesimally small even before convergence).

8 μs
step! (generic function with 5 methods)
3.5 ms
f₆
f₄ (generic function with 2 methods)
2.3 μs
22
107 μs
-0.77
112 μs
0
14.6 ms
68.3 ms

The optimal point is near (-0.3043008040570129, 0.09737789529620958) 🤣

14.9 μs

The optimized value is 1.7034843912261084 🐌🐌🐌

14.6 μs

Really slow convergence...

7.7 μs

RMSProp

8 μs

It comes to the rescue of Adagrad, solving the problem of decreasing learning rate.

11.3 μs

⊙ = \odot (element wise multiplication)

8.2 μs

ŝ(k+1)=γŝ(k)+(1γ)(g(k)g(k))

8.3 μs

γ is tipically close to 0.9

8.1 μs

RMS(gi)=ŝi(k+1)

8.4 μs
step! (generic function with 6 methods)
4 ms
f₇
f₄ (generic function with 2 methods)
4.4 μs
294
124 μs
-1.36
141 μs
0.13
146 μs
37.1 ms

Adadelta

8.6 μs

Removes the learning rate parameter entirely (the same authors of RMSProp did this) 😲😲

7.1 μs

xi(k+1)=xi(k)RMS(Δxi)ϵ+RMS(gi)

6.7 μs
step! (generic function with 7 methods)
5.1 ms
f₈
f₄ (generic function with 2 methods)
2.6 μs
577
67.6 μs
-1.06
110 μs
0.87
105 μs
45.7 ms

Adam

4.6 μs

Adaptative moment estimation method

4.9 μs

Stores exponentially decaying squared gradient (RMSProp and Adadelta), but also decaying gradient like momentum.

4.6 μs

Initializing the gradient and squared gradient to zero introduces a bias. (Good defaults are α=.001, γᵥ=.9, γₛ=.999 and ϵ=1e-8)

5 μs

The equations for Adam are:

4.7 μs
  • Biased decaying momentum: v(k+1)=γvv(k)+(1γv)g(k)

4 ms
  • Biased decaying sq. gradient: s(k+1)=γss(k)+(1γs)(g(k)g(k))

10.9 μs
  • Corrected decaying momentum: v̂(k+1)=v(k+1)/(1γvk)

8.5 μs
  • Corrected decaying sq. gradient: ŝ(k+1)=s(k+1)/(1γsk)

10.9 μs
  • Next iterate: x(k+1)=x(k)αv̂(k+1)/(ϵ+ŝ(k+1))

12 μs
step! (generic function with 8 methods)
6.3 ms
f₉
f₄ (generic function with 2 methods)
4.2 μs
285
61.8 μs
-0.98
105 μs
0.84
107 μs
50.9 ms

Hypergradient Descent

7.1 μs

This methods are too sensitive to the learning rate... let's optimize it first

8.3 μs

We applied gradient descent to the learning reate 🤯🤯

8.5 μs
step! (generic function with 9 methods)
3 ms
f₁₀
f₄ (generic function with 2 methods)
3.1 μs
22
102 μs
-0.77
128 μs
0
101 μs
39.4 ms

Let's do the same to Nesterov momentum

6.6 μs
step! (generic function with 10 methods)
2.5 ms
f₁₁
f₄ (generic function with 2 methods)
1.8 μs
200
77.3 μs
-0.77
76.3 μs
0
64.9 μs
19 ms
plot_method! (generic function with 1 method)
125 μs

Exercises

10 μs

Exercise 5.1. Compute the gradient of xTAx+bTx when A is symmetric

20 μs

Answer: By decomposing the matrix as a sum (really long) and then replacing the sum as a matrix form we get: f=2Ax+b

16.8 μs

Exercise 5.2. Apply gradient descent with unit step to f(x)=x⁴, compute two iterations.

10.3 μs
∇fₑ (generic function with 2 methods)
136 μs
623 ms

Exercise 5.3. Apply one step of gradient descent to f(x)=e^x+e^-x from x =10 with both a unit step and with exact line search.

11.5 μs
∇fₘ (generic function with 2 methods)
55.5 μs

Answer: If someone can do this without the gradient exploting to infinity really quick please help.

25 μs

Exercise 5.4. The conjugate gradient method can be used to find a search direction d when a local quadratic model of a function is available at the current point. With d as search direction, let the model be q(d)=dTHd+bTd+c for a symmetric matrix H, What is the Hessian in this case? What is the gradient of q when d=0? What can go wrong if the conjugate gradient method is applied to the quadratic model to get the search direction d?

12.4 μs

Answer: We knew that: q=2Hd, then the gradient (gradient of gradient) will be: 2q=2H

8.4 μs

When d = 0 the gradient will also be 0. And if this happens we will be dividing by infinitesimal values when computing the next search direction. (This happened to me when using too many iterations)

7.4 μs

Exercise 5.5. How is Nesterov momentum an improvement over momentum?

8.2 μs

Answer: The problem with momentum is that it doesn't slow down, and Nesterov momentum uses the calculation of the gradient at the next step (if it overshoots the gradient will make it go back, correcting the step).

8.9 μs

Exercise 5.6. In what way is the conjugate gradient method an improvement over steepest descent?

9.6 μs

Answer: Gradient descent is really slow near valleys (because the next direction will always be orthogonal to the previous, causing a zig-zag movement), that is corrected when the previous gradient contributes to the next direction (making it more like a straight path to the minimum).

8 μs

Exercise 5.7. In the conjugate gradient descent, what is the normalized descent direction at the first iteration for the function f(x,y)=x2+xy+y2+5 when initialized at (x,y) = (1, 1)? What is the resulting point after two steps of the conjugate gradient method?

10.1 μs
∇fₚ (generic function with 1 method)
42.9 μs
2.7 ms
1.2 ms

Answer: The direction is [-2.9999999999239875, -2.9999999999239875], (Almost the same on each component), And the point is [1.0056882026629292e-6, 1.0056882026629292e-6] after two iterations (expected because this function is polinomial of order two and the conjugate gradient should optimize it in two steps).

14.4 μs

Exercise 5.8. We have a polynomial function f such that f(x) > 2 for all x in three-dimensional Euclidean space. Suppose we are using steepest descent with step lengths optimized at each step, and we want to find a local minimum of f. If our unnormalized descent direction is [1,2,3] at step k, is it possible for our unnormalized descent direction at step k+1 to be [0,0,-3]? Why or why not?

8.5 μs

Answer: We know that in the steepest descent method the descent direction in every iteration is orthogonal to the next direction. [1,2,3][0,0,3]0 Then the unnormalized descent direction at step k+1 cannot be [0,0,-3]. (TODO: Check the restriction for f(x) > 2)

8.1 μs